Main Question
What variables have significant effect on life expectancy at age 60?
Group Name: wallaby
Group members: Miaoyang Kong(mkong22@wisc.edu), Jiaqi Guo(jguo288@wisc.edu), Yutong Wei(ywei88@wisc.edu)
We are interested in life expectancy because we read a piece of news from THE NEW YORK TIME, which is talking about U.S. Life Expectancy Falls Again in ‘Historic’ Setbackwebsite. In recent years, there has been an increasing concern about the length of life span. In addition, many countries have seen an aging population. We hope to do research about life expectancy focusing on finding the effect of 6 explanatory variables on the life expectancy.
What variables have significant effect on life expectancy at age 60?
Source Data
The data of life expectancy is collected by World Health Organizationwebsite. Since the statistics for life expectancy at 60 is predicted for the years 2000, 2010, 2015, and 2019, we will concentrate on these four years. Life expectancy at age 60 reflects the overall mortality level of a population over 60 years. It summarizes the mortality pattern that prevails across all age groups above 60 years. We therefore focused on the impact of mortality from various diseases on life expectancy and wanted to investigate which disease has a greater impact on life expectancy through disease mortality. Because we wish to rule out the impact of the newborn mortality rate on life expectancy, we chose the data for life expectancy at age 60. Furthermore, we are more focused on how factors like money and education affect life expectancy.
For predictor factors, we choose the number of death caused by tuberculosiswebsite, the number of death casued by Noncommunicable diseases(NCD)website, and undernourishment to figure out how specific kind of disease or unhealthy condition may affect the life expectancywebsite; we choose suicide rate to figure out how suicide may affect the life expectancywebsite; we choose the enrollment rate of tertiary school to figure out how education level may affect the life expectancywebsite; we choose the per adult national income to figure out how the economic level may affect the life expectancywebsite. We also select the observations of the predictor factors from the years 2000, 2010, 2015, and 2019.
| Name | Description |
|---|---|
| life_expectancy (response) | Life expectancy at age 60 (years) |
| tuberculousis | Estimated number of deaths due to tuberculosis, excluding HIV |
| NCD | Number of deaths attributed to non-communicable diseases (in thousands) |
| income | Per adult national income |
| suiside | Age-standardized mortality rate (per 100 000 population) |
| education | School enrollment, tertiary (% gross) |
| undernourishment | Prevalence of undernourishment (% of population) |
We omit those observations because certain countries’ data on undernourishment and education are missing.
total <- read_csv("merge.csv")
total[total < 0] <-NA
total=total %>% na.omit()
total %>%
head(10)
## # A tibble: 10 × 9
## Country Year Life expect…¹ Numbe…² Age-s…³ Total…⁴ Per a…⁵ Schoo…⁶ Preva…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 2000 19 23 5.23 656. 21554. 15.5 4.9
## 2 Albania 2010 21.3 9 7.63 506. 37419. 44.5 5.8
## 3 Albania 2015 21.1 8 4.23 514. 40068. 62.0 4.9
## 4 Albania 2019 21.0 8 3.72 602. 43732. 59.8 4.3
## 5 Algeria 2010 21.4 3000 3 488. 26267. 29.9 4.3
## 6 Algeria 2015 21.8 3200 2.72 461. 28058. 36.8 2.8
## 7 Algeria 2019 22.0 2800 2.6 446. 25059. 52.6 2.5
## 8 Angola 2015 16.7 16000 13.3 639. 13326. 8.40 14.5
## 9 Argentina 2000 20.2 890 9.2 540. 17885. 54.0 3
## 10 Argentina 2010 20.6 580 8.43 491. 37678. 73.2 3.1
## # … with abbreviated variable names ¹`Life expectancy`,
## # ²`Number of death due to tuberculosis, excluding HIV`,
## # ³`Age-standarized suicide rates (per 100000 population)`,
## # ⁴`Total NCD Death(inthousands)`, ⁵`Per adult national income`,
## # ⁶`School enrollment, tertiary (% gross)`,
## # ⁷`Prevalence of undernourishment (% of population)`
print(total)
## # A tibble: 409 × 9
## Country Year Life expect…¹ Numbe…² Age-s…³ Total…⁴ Per a…⁵ Schoo…⁶ Preva…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 2000 19 23 5.23 656. 21554. 15.5 4.9
## 2 Albania 2010 21.3 9 7.63 506. 37419. 44.5 5.8
## 3 Albania 2015 21.1 8 4.23 514. 40068. 62.0 4.9
## 4 Albania 2019 21.0 8 3.72 602. 43732. 59.8 4.3
## 5 Algeria 2010 21.4 3000 3 488. 26267. 29.9 4.3
## 6 Algeria 2015 21.8 3200 2.72 461. 28058. 36.8 2.8
## 7 Algeria 2019 22.0 2800 2.6 446. 25059. 52.6 2.5
## 8 Angola 2015 16.7 16000 13.3 639. 13326. 8.40 14.5
## 9 Argentina 2000 20.2 890 9.2 540. 17885. 54.0 3
## 10 Argentina 2010 20.6 580 8.43 491. 37678. 73.2 3.1
## # … with 399 more rows, and abbreviated variable names ¹`Life expectancy`,
## # ²`Number of death due to tuberculosis, excluding HIV`,
## # ³`Age-standarized suicide rates (per 100000 population)`,
## # ⁴`Total NCD Death(inthousands)`, ⁵`Per adult national income`,
## # ⁶`School enrollment, tertiary (% gross)`,
## # ⁷`Prevalence of undernourishment (% of population)`
After observing the non-linearity issue in the origin exploratory plot, we decide to apply log transformation to Number of death due to tuberculosis, excluding HIV`, Total NCD Death(inthousands), Per adult national income, Age-standarized suicide rates (per 100000 population), Prevalence of undernourishment (% of population). Additionally, we applied square root transformation to the variable of School enrollment, tertiary (% gross).
This is exploratory plot of our dataset. We observe a positive relationship between life expectancy and income, between life expectancy and schooling. In addition, We observe a negative relationship between life expectancy and NCD.
total_transform = log(total[-(1:2)])
total_transform$`Life expectancy` = 10^(total_transform$`Life expectancy`)
total_transform$`School enrollment, tertiary (% gross)` = sqrt(10^(total_transform$`School enrollment, tertiary (% gross)`))
pairs(total_transform)
We produces a matrix of scatter plots for visualizing the correlation between variables. We are able to read the scatterplots of each pair visualized in right side of the plot and Pearson correlation value and significance display on the left side.
library(GGally)
ggpairs(total_transform[is.finite(rowSums(total_transform)),], lower = list(continuous = "cor", combo = "box_no_facet", discrete = "facetbar",na="na"),
upper = list(continuous = wrap("smooth", alpha = 0.3, size=0.2)))
We select multiple linear regression model, and include interaction effects. Moreover, we check the VIF value to facilitate us discover the variables which have significant effect on life expectancy.
Multiple Linear Regression Model
Since we want to find out the relationship between life expectancy (the dependent variable or response) and seven factors we are interested in (the independent variables or predictors),that are tuberculousis,NCD,income,suicide,education, and undernourishment, we select the multiple linear regression model.
#Define the response and predictors:
life_expectancy =total$`Life expectancy`
tuberculousis =total$`Number of death due to tuberculosis, excluding HIV`
NCD = total$`Total NCD Death(inthousands)`
income = total$`Per adult national income`
suicide = total$`Age-standarized suicide rates (per 100000 population)`
education=total$`School enrollment, tertiary (% gross)`
undernourishment = total$`Prevalence of undernourishment (% of population)`
We transform the independent variables using ‘log’ and ‘sqrt’ to meet linearity assumption.
log_tuberculousis=log10(tuberculousis)
log_NCD = log(NCD)
log_income=log(income)
log_suicide=log(suicide)
sqrt_education=sqrt(education)
log_undernourishment=log(undernourishment)
To accommodate the transformed independent variable, we create a new dataframe. We remove the missing data in the new dataframe because there is one negative observation in the income column and this observation becomes undefined after the income is “logged.”
new_data=data.frame(life_expectancy,log_tuberculousis,log_NCD,log_income,log_suicide,sqrt_education,log_undernourishment)
new_data[is.na(new_data) | new_data == "-Inf"] <- NA
new_data=new_data %>% na.omit()
new_data %>%
head(10)
## life_expectancy log_tuberculousis log_NCD log_income log_suicide
## 1 19.00 1.3617278 6.486008 9.978319 1.6544113
## 2 21.31 0.9542425 6.226339 10.529923 2.0320878
## 3 21.13 0.9030900 6.242029 10.598345 1.4422020
## 4 21.03 0.9030900 6.400091 10.685847 1.3137237
## 5 21.37 3.4771213 6.189905 10.176057 1.0986123
## 6 21.81 3.5051500 6.132747 10.242027 1.0006319
## 7 22.04 3.4471580 6.099870 10.129001 0.9555114
## 8 16.71 4.2041200 6.460061 9.497503 2.5855058
## 9 20.18 2.9493900 6.291939 9.791727 2.2192035
## 10 20.62 2.7634280 6.197055 10.536835 2.1317968
## sqrt_education log_undernourishment
## 1 3.941651 1.5892352
## 2 6.674523 1.7578579
## 3 7.874492 1.5892352
## 4 7.731656 1.4586150
## 5 5.467123 1.4586150
## 6 6.064760 1.0296194
## 7 7.253960 0.9162907
## 8 2.898437 2.6741486
## 9 7.346074 1.0986123
## 10 8.557535 1.1314021
lmmodel = lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment, data = new_data)
summary(lmmodel)
##
## Call:
## lm(formula = life_expectancy ~ log_tuberculousis + log_NCD +
## log_income + log_suicide + sqrt_education + log_undernourishment,
## data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4601 -0.3456 -0.0102 0.4256 1.6335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.28159 0.88050 70.734 < 2e-16 ***
## log_tuberculousis -0.13508 0.02833 -4.768 2.60e-06 ***
## log_NCD -6.77359 0.11860 -57.113 < 2e-16 ***
## log_income 0.03669 0.03015 1.217 0.224
## log_suicide -0.26235 0.04385 -5.983 4.87e-09 ***
## sqrt_education 0.22116 0.02013 10.985 < 2e-16 ***
## log_undernourishment -0.35890 0.05465 -6.567 1.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5667 on 401 degrees of freedom
## Multiple R-squared: 0.9639, Adjusted R-squared: 0.9634
## F-statistic: 1784 on 6 and 401 DF, p-value: < 2.2e-16
Interaction Effect
To detect if there are interaction effects between pairs of variables that we are interested in, We choose to use AIC in a Stepwise Algorithm, which is an automated method that returns back the optimal set of model.
library(MASS)
lm1 = lm(life_expectancy ~1, data = new_data)
lm2 = lm(life_expectancy ~ (.)^2, data = new_data)
lm.both = stepAIC(lm1, direction="both", scope=list(upper=lm2,lower=lm1))
## Start: AIC=886.51
## life_expectancy ~ 1
##
## Df Sum of Sq RSS AIC
## + log_NCD 1 3231.6 334.4 -77.22
## + sqrt_education 1 1965.2 1600.7 561.72
## + log_undernourishment 1 1597.5 1968.4 646.08
## + log_income 1 1508.7 2057.2 664.08
## + log_tuberculousis 1 926.6 2639.3 765.74
## + log_suicide 1 306.4 3259.6 851.86
## <none> 3566.0 886.51
##
## Step: AIC=-77.22
## life_expectancy ~ log_NCD
##
## Df Sum of Sq RSS AIC
## + sqrt_education 1 163.6 170.7 -349.51
## + log_undernourishment 1 139.4 195.0 -295.25
## + log_income 1 80.2 254.1 -187.17
## + log_tuberculousis 1 45.5 288.8 -134.96
## + log_suicide 1 3.3 331.0 -79.32
## <none> 334.4 -77.22
## - log_NCD 1 3231.6 3566.0 886.51
##
## Step: AIC=-349.51
## life_expectancy ~ log_NCD + sqrt_education
##
## Df Sum of Sq RSS AIC
## + log_NCD:sqrt_education 1 22.12 148.58 -404.14
## + log_undernourishment 1 20.59 150.11 -399.97
## + log_tuberculousis 1 14.87 155.83 -384.71
## + log_suicide 1 10.68 160.02 -373.87
## + log_income 1 5.81 164.89 -361.63
## <none> 170.70 -349.51
## - sqrt_education 1 163.65 334.35 -77.22
## - log_NCD 1 1430.05 1600.75 561.72
##
## Step: AIC=-404.14
## life_expectancy ~ log_NCD + sqrt_education + log_NCD:sqrt_education
##
## Df Sum of Sq RSS AIC
## + log_undernourishment 1 15.3192 133.26 -446.53
## + log_tuberculousis 1 13.2115 135.37 -440.13
## + log_income 1 6.6980 141.88 -420.96
## + log_suicide 1 4.0796 144.50 -413.49
## <none> 148.58 -404.14
## - log_NCD:sqrt_education 1 22.1201 170.70 -349.51
##
## Step: AIC=-446.53
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_NCD:sqrt_education
##
## Df Sum of Sq RSS AIC
## + sqrt_education:log_undernourishment 1 20.2196 113.04 -511.67
## + log_tuberculousis 1 9.0503 124.21 -473.23
## + log_suicide 1 6.2158 127.05 -464.02
## + log_income 1 1.5270 131.74 -449.23
## + log_NCD:log_undernourishment 1 0.9989 132.26 -447.60
## <none> 133.26 -446.53
## - log_undernourishment 1 15.3192 148.58 -404.14
## - log_NCD:sqrt_education 1 16.8451 150.11 -399.97
##
## Step: AIC=-511.67
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_NCD:sqrt_education + sqrt_education:log_undernourishment
##
## Df Sum of Sq RSS AIC
## + log_tuberculousis 1 7.5627 105.48 -537.92
## + log_income 1 1.5318 111.51 -515.24
## + log_suicide 1 1.4366 111.61 -514.89
## - log_NCD:sqrt_education 1 0.5046 113.55 -511.85
## <none> 113.04 -511.67
## + log_NCD:log_undernourishment 1 0.0001 113.04 -509.67
## - sqrt_education:log_undernourishment 1 20.2196 133.26 -446.53
##
## Step: AIC=-537.92
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_NCD:sqrt_education + sqrt_education:log_undernourishment
##
## Df Sum of Sq RSS AIC
## + log_tuberculousis:sqrt_education 1 4.3464 101.13 -553.09
## + log_tuberculousis:log_undernourishment 1 2.0098 103.47 -543.77
## + log_suicide 1 1.0408 104.44 -539.97
## + log_income 1 0.8034 104.68 -539.04
## <none> 105.48 -537.92
## - log_NCD:sqrt_education 1 0.5639 106.04 -537.75
## + log_NCD:log_undernourishment 1 0.2747 105.20 -536.99
## + log_tuberculousis:log_NCD 1 0.2617 105.22 -536.93
## - log_tuberculousis 1 7.5627 113.04 -511.67
## - sqrt_education:log_undernourishment 1 18.7319 124.21 -473.23
##
## Step: AIC=-553.09
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_NCD:sqrt_education + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis
##
## Df Sum of Sq RSS AIC
## + log_suicide 1 1.2563 99.877 -556.19
## - log_NCD:sqrt_education 1 0.1743 101.308 -554.39
## + log_income 1 0.7838 100.350 -554.26
## <none> 101.133 -553.09
## + log_NCD:log_undernourishment 1 0.2608 100.873 -552.14
## + log_tuberculousis:log_NCD 1 0.2469 100.887 -552.09
## + log_tuberculousis:log_undernourishment 1 0.0183 101.115 -551.16
## - sqrt_education:log_tuberculousis 1 4.3464 105.480 -537.92
## - sqrt_education:log_undernourishment 1 12.9780 114.111 -505.83
##
## Step: AIC=-556.19
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_NCD:sqrt_education +
## sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis
##
## Df Sum of Sq RSS AIC
## + log_suicide:log_undernourishment 1 2.1311 97.746 -562.99
## + log_suicide:sqrt_education 1 1.7993 98.078 -561.61
## + log_tuberculousis:log_suicide 1 0.8870 98.990 -557.83
## - log_NCD:sqrt_education 1 0.1005 99.978 -557.78
## + log_income 1 0.8677 99.009 -557.75
## <none> 99.877 -556.19
## + log_tuberculousis:log_NCD 1 0.4314 99.446 -555.96
## + log_NCD:log_undernourishment 1 0.1147 99.762 -554.66
## + log_NCD:log_suicide 1 0.0192 99.858 -554.27
## + log_tuberculousis:log_undernourishment 1 0.0037 99.873 -554.20
## - log_suicide 1 1.2563 101.133 -553.09
## - sqrt_education:log_tuberculousis 1 4.5618 104.439 -539.97
## - sqrt_education:log_undernourishment 1 9.6341 109.511 -520.62
##
## Step: AIC=-562.99
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_NCD:sqrt_education +
## sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis +
## log_undernourishment:log_suicide
##
## Df Sum of Sq RSS AIC
## + log_NCD:log_suicide 1 1.1485 96.597 -565.81
## - log_NCD:sqrt_education 1 0.0139 97.760 -564.93
## + log_tuberculousis:log_NCD 1 0.6738 97.072 -563.81
## + log_income 1 0.5306 97.215 -563.21
## + log_tuberculousis:log_suicide 1 0.4945 97.252 -563.06
## <none> 97.746 -562.99
## + log_suicide:sqrt_education 1 0.1646 97.581 -561.68
## + log_NCD:log_undernourishment 1 0.0579 97.688 -561.23
## + log_tuberculousis:log_undernourishment 1 0.0294 97.717 -561.11
## - log_undernourishment:log_suicide 1 2.1311 99.877 -556.19
## - sqrt_education:log_tuberculousis 1 4.2920 102.038 -547.46
## - sqrt_education:log_undernourishment 1 9.1677 106.914 -528.41
##
## Step: AIC=-565.81
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_NCD:sqrt_education +
## sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis +
## log_undernourishment:log_suicide + log_NCD:log_suicide
##
## Df Sum of Sq RSS AIC
## + log_tuberculousis:log_suicide 1 1.1502 95.447 -568.70
## - log_NCD:sqrt_education 1 0.0173 96.615 -567.74
## + log_income 1 0.6548 95.943 -566.59
## + log_suicide:sqrt_education 1 0.5846 96.013 -566.29
## <none> 96.597 -565.81
## + log_tuberculousis:log_NCD 1 0.3704 96.227 -565.38
## + log_NCD:log_undernourishment 1 0.0569 96.541 -564.05
## + log_tuberculousis:log_undernourishment 1 0.0485 96.549 -564.02
## - log_NCD:log_suicide 1 1.1485 97.746 -562.99
## - log_undernourishment:log_suicide 1 3.2604 99.858 -554.27
## - sqrt_education:log_tuberculousis 1 3.9694 100.567 -551.38
## - sqrt_education:log_undernourishment 1 8.3386 104.936 -534.03
##
## Step: AIC=-568.7
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_NCD:sqrt_education +
## sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis +
## log_undernourishment:log_suicide + log_NCD:log_suicide +
## log_tuberculousis:log_suicide
##
## Df Sum of Sq RSS AIC
## - log_NCD:sqrt_education 1 0.0133 95.461 -570.64
## + log_tuberculousis:log_NCD 1 0.6026 94.845 -569.28
## + log_income 1 0.5942 94.853 -569.25
## + log_suicide:sqrt_education 1 0.5804 94.867 -569.19
## <none> 95.447 -568.70
## + log_NCD:log_undernourishment 1 0.0302 95.417 -566.83
## + log_tuberculousis:log_undernourishment 1 0.0002 95.447 -566.70
## - log_tuberculousis:log_suicide 1 1.1502 96.597 -565.81
## - log_NCD:log_suicide 1 1.8043 97.252 -563.06
## - log_undernourishment:log_suicide 1 3.3048 98.752 -556.81
## - sqrt_education:log_tuberculousis 1 3.6598 99.107 -555.35
## - sqrt_education:log_undernourishment 1 7.9569 103.404 -538.03
##
## Step: AIC=-570.64
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis + log_undernourishment:log_suicide +
## log_NCD:log_suicide + log_tuberculousis:log_suicide
##
## Df Sum of Sq RSS AIC
## + log_income 1 0.6066 94.854 -571.24
## + log_tuberculousis:log_NCD 1 0.6030 94.858 -571.23
## + log_suicide:sqrt_education 1 0.5137 94.947 -570.84
## <none> 95.461 -570.64
## + log_NCD:log_undernourishment 1 0.0432 95.417 -568.83
## + log_NCD:sqrt_education 1 0.0133 95.447 -568.70
## + log_tuberculousis:log_undernourishment 1 0.0000 95.461 -568.64
## - log_tuberculousis:log_suicide 1 1.1543 96.615 -567.74
## - log_NCD:log_suicide 1 1.8020 97.262 -565.01
## - log_undernourishment:log_suicide 1 3.4165 98.877 -558.30
## - sqrt_education:log_tuberculousis 1 3.6699 99.130 -557.25
## - sqrt_education:log_undernourishment 1 9.3535 104.814 -534.50
##
## Step: AIC=-571.24
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis + log_undernourishment:log_suicide +
## log_NCD:log_suicide + log_tuberculousis:log_suicide
##
## Df Sum of Sq RSS AIC
## + log_tuberculousis:log_income 1 2.9058 91.948 -581.94
## + log_tuberculousis:log_NCD 1 0.4662 94.388 -571.25
## <none> 94.854 -571.24
## + log_income:log_undernourishment 1 0.3720 94.482 -570.85
## + log_suicide:sqrt_education 1 0.3549 94.499 -570.77
## - log_income 1 0.6066 95.461 -570.64
## + log_income:sqrt_education 1 0.1146 94.739 -569.74
## + log_income:log_suicide 1 0.1061 94.748 -569.70
## + log_tuberculousis:log_undernourishment 1 0.0298 94.824 -569.37
## + log_NCD:log_undernourishment 1 0.0093 94.845 -569.28
## + log_NCD:log_income 1 0.0077 94.846 -569.28
## + log_NCD:sqrt_education 1 0.0009 94.853 -569.25
## - log_tuberculousis:log_suicide 1 1.0905 95.944 -568.58
## - log_NCD:log_suicide 1 1.9225 96.776 -565.06
## - log_undernourishment:log_suicide 1 3.2360 98.090 -559.56
## - sqrt_education:log_tuberculousis 1 3.7161 98.570 -557.56
## - sqrt_education:log_undernourishment 1 9.5640 104.418 -534.05
##
## Step: AIC=-581.94
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis + log_undernourishment:log_suicide +
## log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income
##
## Df Sum of Sq RSS AIC
## + log_tuberculousis:log_NCD 1 1.9597 89.988 -588.73
## + log_income:log_undernourishment 1 1.3363 90.612 -585.91
## + log_NCD:log_income 1 0.5066 91.442 -582.19
## <none> 91.948 -581.94
## + log_tuberculousis:log_undernourishment 1 0.4121 91.536 -581.77
## + log_suicide:sqrt_education 1 0.2821 91.666 -581.19
## + log_income:log_suicide 1 0.1527 91.795 -580.62
## - sqrt_education:log_tuberculousis 1 0.8553 92.803 -580.16
## + log_income:sqrt_education 1 0.0294 91.919 -580.07
## + log_NCD:log_undernourishment 1 0.0130 91.935 -580.00
## + log_NCD:sqrt_education 1 0.0122 91.936 -579.99
## - log_tuberculousis:log_suicide 1 1.7881 93.736 -576.08
## - log_NCD:log_suicide 1 2.2577 94.206 -574.04
## - log_tuberculousis:log_income 1 2.9058 94.854 -571.24
## - log_undernourishment:log_suicide 1 3.3724 95.320 -569.24
## - sqrt_education:log_undernourishment 1 9.5632 101.511 -543.57
##
## Step: AIC=-588.73
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis + log_undernourishment:log_suicide +
## log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income +
## log_NCD:log_tuberculousis
##
## Df Sum of Sq RSS AIC
## + log_income:log_undernourishment 1 1.0203 88.968 -591.38
## + log_income:log_suicide 1 0.4932 89.495 -588.97
## <none> 89.988 -588.73
## + log_tuberculousis:log_undernourishment 1 0.2872 89.701 -588.03
## + log_suicide:sqrt_education 1 0.1456 89.843 -587.39
## + log_NCD:log_income 1 0.1111 89.877 -587.23
## + log_NCD:log_undernourishment 1 0.0649 89.923 -587.02
## + log_NCD:sqrt_education 1 0.0190 89.969 -586.81
## + log_income:sqrt_education 1 0.0003 89.988 -586.73
## - log_NCD:log_suicide 1 1.6067 91.595 -583.51
## - sqrt_education:log_tuberculousis 1 1.7712 91.760 -582.77
## - log_NCD:log_tuberculousis 1 1.9597 91.948 -581.94
## - log_tuberculousis:log_suicide 1 2.6337 92.622 -578.96
## - log_undernourishment:log_suicide 1 3.3644 93.353 -575.75
## - log_tuberculousis:log_income 1 4.3994 94.388 -571.25
## - sqrt_education:log_undernourishment 1 9.0513 99.040 -551.62
##
## Step: AIC=-591.38
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis + log_undernourishment:log_suicide +
## log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income +
## log_NCD:log_tuberculousis + log_undernourishment:log_income
##
## Df Sum of Sq RSS AIC
## + log_income:log_suicide 1 0.5554 88.413 -591.94
## <none> 88.968 -591.38
## + log_income:sqrt_education 1 0.3547 88.613 -591.01
## + log_tuberculousis:log_undernourishment 1 0.2331 88.735 -590.45
## + log_suicide:sqrt_education 1 0.1809 88.787 -590.21
## + log_NCD:log_undernourishment 1 0.0846 88.883 -589.77
## + log_NCD:sqrt_education 1 0.0174 88.951 -589.46
## + log_NCD:log_income 1 0.0057 88.962 -589.41
## - log_undernourishment:log_income 1 1.0203 89.988 -588.73
## - sqrt_education:log_tuberculousis 1 1.2640 90.232 -587.62
## - log_NCD:log_tuberculousis 1 1.6438 90.612 -585.91
## - log_NCD:log_suicide 1 1.6563 90.624 -585.85
## - log_tuberculousis:log_suicide 1 2.1914 91.159 -583.45
## - log_undernourishment:log_suicide 1 3.9055 92.874 -575.85
## - log_tuberculousis:log_income 1 5.1878 94.156 -570.26
## - sqrt_education:log_undernourishment 1 9.3941 98.362 -552.43
##
## Step: AIC=-591.94
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis + log_undernourishment:log_suicide +
## log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income +
## log_NCD:log_tuberculousis + log_undernourishment:log_income +
## log_suicide:log_income
##
## Df Sum of Sq RSS AIC
## + log_income:sqrt_education 1 0.7674 87.645 -593.49
## + log_tuberculousis:log_undernourishment 1 0.6398 87.773 -592.90
## <none> 88.413 -591.94
## - log_suicide:log_income 1 0.5554 88.968 -591.38
## + log_suicide:sqrt_education 1 0.2884 88.124 -591.27
## + log_NCD:log_undernourishment 1 0.0761 88.336 -590.29
## + log_NCD:sqrt_education 1 0.0522 88.360 -590.18
## + log_NCD:log_income 1 0.0355 88.377 -590.10
## - sqrt_education:log_tuberculousis 1 0.8826 89.295 -589.88
## - log_undernourishment:log_income 1 1.0825 89.495 -588.97
## - log_NCD:log_suicide 1 1.3099 89.722 -587.93
## - log_NCD:log_tuberculousis 1 1.9828 90.395 -584.89
## - log_tuberculousis:log_suicide 1 2.5905 91.003 -582.15
## - log_undernourishment:log_suicide 1 4.4591 92.872 -573.86
## - log_tuberculousis:log_income 1 5.6298 94.042 -568.75
## - sqrt_education:log_undernourishment 1 9.8874 98.300 -550.68
##
## Step: AIC=-593.49
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## sqrt_education:log_tuberculousis + log_undernourishment:log_suicide +
## log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income +
## log_NCD:log_tuberculousis + log_undernourishment:log_income +
## log_suicide:log_income + sqrt_education:log_income
##
## Df Sum of Sq RSS AIC
## - sqrt_education:log_tuberculousis 1 0.3961 88.041 -593.65
## <none> 87.645 -593.49
## + log_suicide:sqrt_education 1 0.3707 87.274 -593.22
## + log_tuberculousis:log_undernourishment 1 0.3158 87.329 -592.96
## - sqrt_education:log_income 1 0.7674 88.413 -591.94
## + log_NCD:log_income 1 0.0330 87.612 -591.65
## + log_NCD:log_undernourishment 1 0.0106 87.635 -591.54
## + log_NCD:sqrt_education 1 0.0089 87.636 -591.53
## - log_suicide:log_income 1 0.9681 88.613 -591.01
## - log_NCD:log_suicide 1 1.0548 88.700 -590.61
## - log_NCD:log_tuberculousis 1 1.6562 89.301 -587.85
## - log_undernourishment:log_income 1 1.8094 89.455 -587.15
## - log_tuberculousis:log_suicide 1 2.7085 90.354 -583.07
## - log_undernourishment:log_suicide 1 4.7572 92.402 -573.93
## - log_tuberculousis:log_income 1 6.0929 93.738 -568.07
## - sqrt_education:log_undernourishment 1 6.7236 94.369 -565.34
##
## Step: AIC=-593.65
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## log_undernourishment:log_suicide + log_NCD:log_suicide +
## log_tuberculousis:log_suicide + log_tuberculousis:log_income +
## log_NCD:log_tuberculousis + log_undernourishment:log_income +
## log_suicide:log_income + sqrt_education:log_income
##
## Df Sum of Sq RSS AIC
## + log_suicide:sqrt_education 1 0.4848 87.557 -593.90
## <none> 88.041 -593.65
## + log_tuberculousis:sqrt_education 1 0.3961 87.645 -593.49
## + log_NCD:log_income 1 0.1028 87.939 -592.13
## + log_tuberculousis:log_undernourishment 1 0.0269 88.014 -591.78
## + log_NCD:sqrt_education 1 0.0035 88.038 -591.67
## + log_NCD:log_undernourishment 1 0.0003 88.041 -591.65
## - log_NCD:log_suicide 1 1.1515 89.193 -590.35
## - sqrt_education:log_income 1 1.2540 89.295 -589.88
## - log_NCD:log_tuberculousis 1 1.3234 89.365 -589.56
## - log_suicide:log_income 1 1.4823 89.524 -588.84
## - log_undernourishment:log_income 1 2.6273 90.669 -583.65
## - log_tuberculousis:log_suicide 1 2.9166 90.958 -582.36
## - log_undernourishment:log_suicide 1 5.6335 93.675 -570.35
## - sqrt_education:log_undernourishment 1 7.8362 95.877 -560.86
## - log_tuberculousis:log_income 1 8.7134 96.755 -557.15
##
## Step: AIC=-593.9
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
## log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment +
## log_undernourishment:log_suicide + log_NCD:log_suicide +
## log_tuberculousis:log_suicide + log_tuberculousis:log_income +
## log_NCD:log_tuberculousis + log_undernourishment:log_income +
## log_suicide:log_income + sqrt_education:log_income + sqrt_education:log_suicide
##
## Df Sum of Sq RSS AIC
## <none> 87.557 -593.90
## - sqrt_education:log_suicide 1 0.4848 88.041 -593.65
## + log_tuberculousis:sqrt_education 1 0.2821 87.274 -593.22
## + log_NCD:log_income 1 0.1039 87.453 -592.39
## + log_NCD:sqrt_education 1 0.0771 87.479 -592.26
## + log_tuberculousis:log_undernourishment 1 0.0458 87.511 -592.12
## + log_NCD:log_undernourishment 1 0.0048 87.552 -591.93
## - log_NCD:log_tuberculousis 1 1.2451 88.802 -590.14
## - sqrt_education:log_income 1 1.3082 88.865 -589.85
## - log_NCD:log_suicide 1 1.4632 89.020 -589.14
## - log_suicide:log_income 1 1.6721 89.229 -588.19
## - log_undernourishment:log_suicide 1 2.1506 89.707 -586.00
## - log_undernourishment:log_income 1 2.7111 90.268 -583.46
## - log_tuberculousis:log_suicide 1 2.8991 90.456 -582.61
## - sqrt_education:log_undernourishment 1 7.5734 95.130 -562.06
## - log_tuberculousis:log_income 1 8.5091 96.066 -558.06
Further, we draw the interaction plots to visualize the model with the minimum value of AIC in order to prove the model in stepAIC.
library(interactions)
lm_refit1=lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+sqrt_education:log_undernourishment, data = new_data)
interact_plot(lm_refit1, pred=sqrt_education, modx=log_undernourishment)
## Warning: 0.862584248762756 is outside the observed range of log_undernourishment
lm_refit2=lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_income:log_tuberculousis, data = new_data)
interact_plot(lm_refit2, pred=log_income, modx=log_tuberculousis)
lm_refit3=lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_suicide:log_undernourishment, data = new_data)
interact_plot(lm_refit3, pred=log_suicide, modx=log_undernourishment)
## Warning: 0.862584248762756 is outside the observed range of log_undernourishment
lm_refit4=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_tuberculousis:log_suicide, data = new_data)
interact_plot(lm_refit4, pred=log_tuberculousis, modx=log_suicide)
lm_refit5=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_tuberculousis:log_income, data = new_data)
interact_plot(lm_refit5, pred=log_tuberculousis, modx=log_income)
lm_refit6=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_NCD:log_tuberculousis, data = new_data)
interact_plot(lm_refit6, pred=log_NCD, modx=log_tuberculousis)
lm_refit7=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_undernourishment:log_income, data = new_data)
interact_plot(lm_refit7, pred=log_undernourishment, modx=log_income)
lm_refit8=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_suicide:log_income, data = new_data)
interact_plot(lm_refit8, pred=log_suicide, modx=log_income)
lm_refit9=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+sqrt_education:log_income, data = new_data)
interact_plot(lm_refit9, pred=sqrt_education, modx=log_income)
lm_refit10=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+sqrt_education:log_suicide, data = new_data)
interact_plot(lm_refit10, pred=sqrt_education, modx=log_suicide)
We observe that the lines are not parallel in the sqrt(education) and log(undernourishment), log(undernourishment) and log(suicide), log(NCD) and log(suicide), log(tuberculousis) and log(suicide), log(tuberculousis) and log(income), log(NCD) and log(tuberculousis), log(undernourishment) and log(income), log(suicide) and log(income), sqrt(education) and log(income), sqrt(education) and log(suicide) interaction plots. Therefore, we include these ten interactions into our final multiple linear regression model.
VIF Values
We use VIF values to measures the strength of the correlation between the independent variables in regression analysis in order to avoid the occurrence of multicollinearity which inflates the variance and type II error.
f_lm1=lm(life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
log_suicide + log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide + log_NCD:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_NCD:log_tuberculousis + log_undernourishment:log_income +
log_suicide:log_income + sqrt_education:log_income + sqrt_education:log_suicide, data = new_data)
vif(f_lm1)
## log_NCD sqrt_education
## 21.414871 303.185493
## log_undernourishment log_suicide
## 244.869351 829.862161
## log_income sqrt_education:log_undernourishment
## 77.857803 7.959714
## log_undernourishment:log_suicide log_NCD:log_suicide
## 54.516802 699.145664
## log_suicide:log_tuberculousis log_income:log_tuberculousis
## 21.659061 49.025468
## log_NCD:log_tuberculousis log_undernourishment:log_income
## 69.211869 141.275009
## log_suicide:log_income sqrt_education:log_income
## 97.059594 318.407955
## sqrt_education:log_suicide
## 63.301258
f_lm2=lm(life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide + log_NCD:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_NCD:log_tuberculousis + log_undernourishment:log_income +
log_suicide:log_income + sqrt_education:log_income + sqrt_education:log_suicide, data = new_data)
vif(f_lm2)
## log_NCD sqrt_education
## 6.705396 269.009692
## log_undernourishment log_income
## 241.248338 69.908005
## sqrt_education:log_undernourishment log_undernourishment:log_suicide
## 7.947199 53.760328
## log_NCD:log_suicide log_suicide:log_tuberculousis
## 126.264056 21.524340
## log_income:log_tuberculousis log_NCD:log_tuberculousis
## 48.713482 68.309443
## log_undernourishment:log_income log_income:log_suicide
## 140.201731 77.497715
## sqrt_education:log_income sqrt_education:log_suicide
## 303.288666 53.448179
f_lm3=lm(life_expectancy ~ log_NCD + sqrt_education + log_undernourishment +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide + log_NCD:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_NCD:log_tuberculousis + log_undernourishment:log_income +
log_suicide:log_income + sqrt_education:log_suicide, data = new_data)
vif(f_lm3)
## log_NCD sqrt_education
## 6.499978 43.231083
## log_undernourishment log_income
## 171.678544 25.238230
## sqrt_education:log_undernourishment log_undernourishment:log_suicide
## 6.016976 53.729823
## log_NCD:log_suicide log_suicide:log_tuberculousis
## 124.463091 21.483681
## log_income:log_tuberculousis log_NCD:log_tuberculousis
## 48.646500 68.297705
## log_undernourishment:log_income log_income:log_suicide
## 109.326006 74.271945
## sqrt_education:log_suicide
## 52.492097
f_lm4=lm(life_expectancy ~ log_NCD + sqrt_education +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide + log_NCD:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_NCD:log_tuberculousis + log_undernourishment:log_income +
log_suicide:log_income + sqrt_education:log_suicide, data = new_data)
vif(f_lm4)
## log_NCD sqrt_education
## 6.361508 42.842948
## log_income sqrt_education:log_undernourishment
## 15.334013 5.648248
## log_undernourishment:log_suicide log_NCD:log_suicide
## 39.046183 116.204892
## log_suicide:log_tuberculousis log_income:log_tuberculousis
## 21.330630 45.316191
## log_NCD:log_tuberculousis log_income:log_undernourishment
## 63.348578 26.964444
## log_income:log_suicide sqrt_education:log_suicide
## 74.082741 49.475127
f_lm5=lm(life_expectancy ~ log_NCD + sqrt_education +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_NCD:log_tuberculousis + log_undernourishment:log_income +
log_suicide:log_income + sqrt_education:log_suicide, data = new_data)
vif(f_lm5)
## log_NCD sqrt_education
## 2.274173 39.277067
## log_income sqrt_education:log_undernourishment
## 14.265740 5.550315
## log_undernourishment:log_suicide log_suicide:log_tuberculousis
## 21.477249 17.190999
## log_income:log_tuberculousis log_NCD:log_tuberculousis
## 44.045708 63.263552
## log_income:log_undernourishment log_income:log_suicide
## 19.032463 36.644697
## sqrt_education:log_suicide
## 45.462863
f_lm6=lm(life_expectancy ~ log_NCD + sqrt_education +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_undernourishment:log_income +
log_suicide:log_income + sqrt_education:log_suicide, data = new_data)
vif(f_lm6)
## log_NCD sqrt_education
## 1.981848 38.464524
## log_income sqrt_education:log_undernourishment
## 13.322987 5.536814
## log_undernourishment:log_suicide log_suicide:log_tuberculousis
## 21.015437 15.818216
## log_income:log_tuberculousis log_income:log_undernourishment
## 7.820429 18.960750
## log_income:log_suicide sqrt_education:log_suicide
## 34.966575 44.890353
f_lm7=lm(life_expectancy ~ log_NCD + sqrt_education +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_undernourishment:log_income +
log_suicide:log_income , data = new_data)
vif(f_lm7)
## log_NCD sqrt_education
## 1.971445 11.159202
## log_income sqrt_education:log_undernourishment
## 8.014695 5.517505
## log_undernourishment:log_suicide log_suicide:log_tuberculousis
## 14.494598 15.600938
## log_income:log_tuberculousis log_income:log_undernourishment
## 7.761027 16.060096
## log_income:log_suicide
## 6.928652
f_lm8=lm(life_expectancy ~ log_NCD + sqrt_education +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide +
log_tuberculousis:log_suicide + log_tuberculousis:log_income +
log_suicide:log_income , data = new_data)
vif(f_lm8)
## log_NCD sqrt_education
## 1.970172 5.792327
## log_income sqrt_education:log_undernourishment
## 4.910514 2.289050
## log_undernourishment:log_suicide log_suicide:log_tuberculousis
## 6.727141 15.402695
## log_income:log_tuberculousis log_income:log_suicide
## 7.542750 5.683371
f_lm9=lm(life_expectancy ~ log_NCD + sqrt_education +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide +
log_tuberculousis:log_income +
log_suicide:log_income , data = new_data)
vif(f_lm9)
## log_NCD sqrt_education
## 1.957256 5.592539
## log_income sqrt_education:log_undernourishment
## 3.201881 2.117339
## log_undernourishment:log_suicide log_income:log_tuberculousis
## 5.600148 1.118468
## log_income:log_suicide
## 3.406178
We only keep variables and interaction terms which have vif values below 10 in our final linear regression model.
final_lm = lm(life_expectancy ~ log_NCD + sqrt_education +
log_income + sqrt_education:log_undernourishment +
log_undernourishment:log_suicide +
log_tuberculousis:log_income +
log_suicide:log_income , data = new_data)
summary(final_lm)
##
## Call:
## lm(formula = life_expectancy ~ log_NCD + sqrt_education + log_income +
## sqrt_education:log_undernourishment + log_undernourishment:log_suicide +
## log_tuberculousis:log_income + log_suicide:log_income, data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77881 -0.31544 0.02725 0.33786 1.92443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.907978 0.841728 74.737 < 2e-16 ***
## log_NCD -6.940823 0.112092 -61.921 < 2e-16 ***
## sqrt_education 0.081917 0.025634 3.196 0.00151 **
## log_income -0.005242 0.032256 -0.163 0.87097
## sqrt_education:log_undernourishment 0.084678 0.011754 7.204 2.92e-12 ***
## log_undernourishment:log_suicide -0.286667 0.024175 -11.858 < 2e-16 ***
## log_income:log_tuberculousis -0.012149 0.002478 -4.903 1.38e-06 ***
## log_income:log_suicide 0.035776 0.006310 5.670 2.74e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.523 on 400 degrees of freedom
## Multiple R-squared: 0.9693, Adjusted R-squared: 0.9688
## F-statistic: 1805 on 7 and 400 DF, p-value: < 2.2e-16
(1) Analysis of coefficients
From the summary table, we can analyse the coefficients which we are interested in:
48.3% is the expected decrease in life expectancy if we were to increase the log of the number of NCD Death by one unit, keeping everything else constant. It can also be interpreted as the effect of the log of the number of NCD Death on life expectancy, controlling for the rest of the 6 variables in the model.
0.67% is the expected increase in life expectancy if we were to increase the log of tertiary school enrollment rate by one unit, keeping everything else constant. It can also be interpreted as the effect of tertiary school enrollment rate on life expectancy, controlling for the rest of the 6 variables in the model.
(2) Interaction Effects
The coefficient of square root of School enrollment, tertiary (% gross) increases by 0.084678 for every unit increase on the log of prevalence of undernourishment
The coefficient of log of prevalence of undernourishment decreases by 0.286667 for every unit increase on the log of Age-standardized mortality rate (per 100 000 population) .
The coefficient of log of the number of death due to tuberculosis (excluding HIV) decreases by 0.005276 for every unit increase on the log of per adult national income.
The coefficient of log of Age-standardized mortality rate (per 100 000 population) increases by 0.035776 for every unit increase on the log of per adult national income.
(3) VIF Values By testing the VIF, we only keep the log(NCD), sqrt(education), log(income) and interaction terms of sqrt(education) and log(undernourishment),log(undernourishment) and log(suicide),log(tuberculousis) and log(income),log(suicide) and log(income) which have VIF values below 10. It means that these variables and interaction terms has no highly correlation between each other.
(4) P-values
The P-values of the variables of log of NCD, square root of education, the interaction term between the square root of education and the log of undernourishment, the interaction term between the log of undernourishment and the log of suicide, the interaction term between the log of tuberculousis and the log of income, and the interaction term between the log of suicide and the log of income are less than 0.05, which implies that these terms have significant effects on the response variable.
(5) Limitations
One limitation is the problem of missing data in variable education (School enrollment, tertiary (% gross)) and negative values in variable income (Per adult national income). The reason these missing or negative data is a problem is that after we apply non-linear transformations–log() to income and sqrt() to education, there will be a large amount of NaN and -Inf in the dataset.
To get rid of this problem, we tried to approximate missing data in education by calculating a function for education, that is, finding the interpolation for missing data. The idea is countries with missing data in the four years we focus on still have available data in other years. If we could find a general trend about how education value change in different years, we could calculate an estimated value for missing data.However, by drawing the plot of known education values of twenty randomly selected countries, and repeating the process for three times, we failed to find a function for the interpolation.
Below are three plots of known education values:
library(tidyverse)
library(readxl)
school = read_xls("schooling.xls",sheet=2)
school %>% pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>%
group_by(`Country Name`) %>%
summarise(pre2000 = sum((year<2000)*(val*0+1),na.rm=T),
btw2000.2010 = sum((year>2000)*(year<2010)*(val*0+1),na.rm=T),
btw2010.2015 = sum((year>2010)*(year<2015)*(val*0+1),na.rm=T),
post2015 = sum((year>2015)*(val*0+1),na.rm=T)) -> school.ys
par(mfrow = c(3,1))
school %>%
pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>%
filter(`Country Name` %in% sample(unique(`Country Name`),20)) %>%
ggplot(aes(x=as.numeric(year),y=val,color=`Country Name`))+geom_point()+geom_line()
## Warning: Removed 696 rows containing missing values (geom_point).
## Warning: Removed 582 row(s) containing missing values (geom_path).
school %>%
pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>%
filter(`Country Name` %in% sample(unique(`Country Name`),20)) %>%
ggplot(aes(x=as.numeric(year),y=val,color=`Country Name`))+geom_point()+geom_line()
## Warning: Removed 583 rows containing missing values (geom_point).
## Warning: Removed 447 row(s) containing missing values (geom_path).
school %>%
pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>%
filter(`Country Name` %in% sample(unique(`Country Name`),20)) %>%
ggplot(aes(x=as.numeric(year),y=val,color=`Country Name`))+geom_point()+geom_line()
## Warning: Removed 734 rows containing missing values (geom_point).
## Warning: Removed 525 row(s) containing missing values (geom_path).
We could see that there is not a fixed trend of these data. Some of the lines are linear, some are curves, some are not even monotone(i.e. The school enrollment value of that country may increase first, then decrease, and increase again). Without a good function, applying approximation for missing data may cause a bigger bias. With this consideration, we chose to delete the year with missing education value from that country.
In addition, since the log of a negative value is undefined, which would cause a problem in the linear regression model, we chose to delete the year with negative income value from that country.
We collected the life expectancy data and seven explanatory variables data we are interested in. By constructing multiple linear regression model and anova model, we find that log(NCD), sqrt(education), log(income) and interaction terms of sqrt(education) and log(undernourishment),log(undernourishment) and log(suicide),log(tuberculousis) and log(income),log(suicide) and log(income) have significant effect on life expectancy at Age 60 (response variable).